Skip to content

Conversation

@fmeum
Copy link
Collaborator

@fmeum fmeum commented Oct 18, 2025

Only the outer buildForFiles/buildForSpawn entry points may block on futures now. This fixes deadlocks that could happen when all threads in MERKLE_TREE_BUILD_POOL block on futures scheduled on this pool.

The existing golden test now builds the same Merkle tree multiple times concurrently, which reproduces the deadlock before this change on virtually every run and passes in 100 attempts with the fix.

Fixes #27331

@fmeum fmeum changed the title Deadlock Don't block within futures in MerkleTreeComputer Oct 18, 2025
@fmeum fmeum marked this pull request as ready for review October 18, 2025 14:57
@fmeum fmeum requested a review from a team as a code owner October 18, 2025 14:57
@fmeum fmeum requested review from coeuvre and tjgq October 18, 2025 14:57
@github-actions github-actions bot added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Oct 18, 2025
@fmeum
Copy link
Collaborator Author

fmeum commented Oct 18, 2025

@carmenchui This probably should be cherry-picked into the last rolling release together with an upcoming fix for #27332.

@fmeum fmeum requested a review from Copilot October 19, 2025 22:00
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR prevents deadlocks by avoiding blocking inside futures in MerkleTreeComputer, ensuring only the outer entry points block. It also adds a concurrency-focused golden test to validate multiple simultaneous Merkle tree builds.

  • Refactor MerkleTreeComputer to return CompletableFuture from internal build path and precompute subtrees upfront.
  • Adjust thread pool parallelism in tests and add a concurrent golden test to exercise potential deadlocks.
  • Update BUILD to include TestType dependency used for test-detection-based parallelism adjustments.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/test/java/com/google/devtools/build/lib/remote/RemoteExecutionServiceTest.java Adds a concurrent Merkle tree build test using virtual threads to reproduce deadlocks and validate correctness.
src/main/java/com/google/devtools/build/lib/remote/merkletree/MerkleTreeComputer.java Refactors build flow to be non-blocking internally, precomputes subtrees, and adjusts executor behavior and case handling.
src/main/java/com/google/devtools/build/lib/remote/merkletree/BUILD Adds dependency on TestType for test-detection logic in executor sizing.
Comments suppressed due to low confidence (1)

src/main/java/com/google/devtools/build/lib/remote/merkletree/MerkleTreeComputer.java:1

  • Restricting the symlink case to Artifact.SpecialArtifact excludes regular Artifact symlinks. This causes non-special symlink inputs to fall through and be processed incorrectly. Replace the case type with Artifact to cover all symlink inputs.
// Copyright 2025 The Bazel Authors. All rights reserved.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize this when reviewing your original PR, but use of CompletableFuture is strongly discouraged at Google, with Guava's ListenableFuture as the recommended replacement.

(There's an internal-only doc that I unfortunately cannot share with a litany of reasons, the most concerning being that CompletableFuture does not propagate cancellation correctly.)

Given that we have to restructure this code anyway, how difficult would it be to switch to ListenableFuture as well?

@fmeum
Copy link
Collaborator Author

fmeum commented Oct 20, 2025

The Caffeine async cache API uses CompletableFuture, which is why I started to use it here. Can I adapt it to ListenableFuture or should I write my own async cache logic based on a synchronous cache with ListenableFuture values?

@tjgq
Copy link
Contributor

tjgq commented Oct 20, 2025

The Caffeine async cache API uses CompletableFuture, which is why I started to use it here. Can I adapt it to ListenableFuture or should I write my own async cache logic based on a synchronous cache with ListenableFuture values?

I'm not sure. I think adapting would be preferable (less tricky code to write/maintain), but it seems that the only way Guava lets you do this is JdkFutureAdapters.listenInPoolThread, which consumes an additional thread per future being listened on? That... doesn't sound great.

@fmeum
Copy link
Collaborator Author

fmeum commented Oct 21, 2025

@tjgq I rewrote this with ListenableFuture and also replaced AsyncCache with just a ConcurrentHashMap.

@tjgq tjgq added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Oct 21, 2025
@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Oct 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Remote-Exec Issues and PRs for the Execution (Remote) team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

deadlock with new merkle tree computer

2 participants